[API-359] Adds user score indexer#488
Conversation
raymondjacobson
left a comment
There was a problem hiding this comment.
i dont love this honestly, but i'm down to release it and see how it performs.
it's a lot of manual work to track if we end up with lots of jobs like these.
wondering if instead when we re-do indexing, we do something more "streaming"-like where we watch for certain things and then run one off queries against individual users. but would have to put more thought into that.
@rickyrombo may have a lot of thoughts here, but I am generally in favor of moving quickly, trying things, and then course correcting if they don't pan out
I don't love it either. There's a lot of downside to not updating scores immediately when the conditions affecting them change. But adding triggers in all the right places also feels fraught. Adding a new feature to the score calculation means making sure we find all the places where that gets changed and triggering a score calculation. We could probably get close enough by leaning on triggers on a bunch of tables. But triggers are also causing us a lot of headache the more we use them 🤷 . I can spend a little more time and see if I can get the update query to work efficiently with small batches of users so we can just loop on that until its done and throw out all the multi replica logic. Then it's just the same thing we're doing today, only a little slower to finish the cycle so that it doesn't block indexing writes. |
rickyrombo
left a comment
There was a problem hiding this comment.
Finally got around to looking into this a bit...
tl;dr - I think I agree let's go forward with this and keep things moving and iterate on it later.
Truth be told I don't have much experience w/ this scoring query so it's hard to give good advice. I think I agree that some utility/intermediate tables might help. I also think I might agree with the endlessly cycling approach vs timer based. Generally agree we want to minimize job count creep, and realtime "streaming" updates sounds nice, but there's a point in queries like these where the cost of the updates on each request and the complexity of ensuring the proper triggers exceeds the benefit (similar to the Solana indexer things, where I'm also wary).
Might be a really dumb q: what does a read/calculation of a score look like for a single user? If it's fast enough (or can be made to be fast), maybe we do the calculation and update the score on read, like a cache (or even better yet, no score caching and just compute on demand). That would save a ton of wasted cycles on recalculating scores of inactive users...
Maybe something like:
- If the score hasn't expired, return it
- If the score has soft-expired, return it, and update the score after returning
- If the score is very expired, calculate first, return new result
I'm not generally aware of how the score gets used though or if that's reasonable. If it's being used at the query level for things that kinda falls apart - would need some separate app code probably... but yeah forget all that I say ship as-is. This is probably one of the more gnarly of aggregates as it doesn't have a lane and sort of reaches into all sorts of tables for signals, so of all the ones to be a job I think this one is worthy anyway.....
There are some expensive bits in this query that aren't consistent across users. Play history, reposts, followers all can be huge or small. So maybe the score query returns immediately or maybe it takes a few seconds. And it doesn't help that some of the conditions that invalidate your score are actions taken by other users. We could always set a minimum interval for a score update (only update if updated_at > 5 mins ago or something like that). I think that's an avenue worth exploring for the second pass at this if streaming updates prove to be too complicated to implement in a maintainable/reliable way. |
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 21650187 | Triggered | Generic High Entropy Secret | e4e2f43 | solana/indexer/damm_v2/indexer_test.go | View secret |
| 21650188 | Triggered | Generic High Entropy Secret | e4e2f43 | solana/indexer/damm_v2/indexer_test.go | View secret |
| 1606950 | Triggered | Generic High Entropy Secret | e4e2f43 | solana/indexer/damm_v2/indexer_test.go | View secret |
| 21650189 | Triggered | Generic High Entropy Secret | e4e2f43 | solana/indexer/damm_v2/indexer_test.go | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
|
@raymondjacobson @rickyrombo Latest changes clean it up a bit and moves some of the features we use in the query to be precomputed and updated in a streaming fashion. Not quite ready for on-the-fly usage, but it should be faster! |
### Description Being replaced by: AudiusProject/api#488 ### How Has This Been Tested? Lots of manual testing on the new indexer against prod data.
This is an attempt to do a less disruptive update to
aggregate_user.score. The legacy DN implementation ran an update query that would compute all user scores at once and then update them in the aggregate_user table.This is bad for at least the following reasons:
get_user_scoreswas also doing some stuff per-user that made the query a little inefficient.New plan is thus:
distinctis really slow here), then pushes the updated ids/scores into the write replica.Some additional features added after PR feedback:
Testing on local machine with a prod replica as a data source it takes ~ 3 minutes, but that's without the new index on aggregate_user.
This is a halfway step between our existing slow query and something that can update scores on a streaming basis.